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A PERFORMANCE OPTIMIZED APPROACH 
FOR EFFICIENT NUMERICAL COMPUTATIONS 

FIELD OF THE INVENTION 

The present invention relates to the areas of computation and algorithms and 
specifically to the areas of digital signal processing ("DSP") and digital logic for 
performing DSP operations to the areas of digital signal processing, algorithms, 
structures and systems for performing digital signal processing. In particular, the 
present invention relates to a method and system for improving the efficiency of 
computational processes and specifically multiply and accumulate ("MAC") 
processes such as the DCT ("Discrete Cosine Transform") and/or BDCT ("Inverse 
Discrete Cosine Transform") using a performance optimized method and associated 
hardware apparatus. 

BACKGROUND INFORMATION 

Digital signal processing ("DSP") and information theory technology is 
essential to modern information processing and in telecommunications for both the 
efficient storage and transmission of data. In particular, effective multimedia 
communications including speech, audio and video relies on efficient methods and 
structures for compression of the multimedia data in order to conserve bandwidth on 
the transmission channel as well as to conserve storage requirements. 

Many DSP algorithms rely on transform kernels such as an FFT ("Fast Fourier 
Transform"), DCT ("Discrete Cosine Transform"), etc. For example, the discrete 
cosine transform ("DCT") has become a very widely used component in performing 
compression of multimedia information, in particular video information. The DCT is 
a loss-less mathematical transformation that converts a spatial or time representation 
of a signal into a frequency representation. The DCT offers attractive properties for 
converting between spatial/time domain and frequency representations of signals as 
opposed to other transforms such as the DFT ("Discrete Fourier Transform")/FFT. In 
particular, the kernel of the transform is real, reducing the complexity of processor 
calculations that must be performed. In addition, a significant advantage of the DCT 
for compression is that it exhibits an energy compaction property, wherein the signal 
energy in the transform domain is concentrated in low frequency components, while 
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higher frequency components are typically much smaller in magnitude, and may often 
be discarded. The DCT is in fact asymptotic to the statistically optimal 
Karhunen-Loeve transform ("KLT") for Markov signals of all orders. Since its 
introduction in 1974, the DCT has been used in many applications such as filtering, 
5 transmultiplexers, speech coding, image coding (still frame, video and image storage), 
pattern recognition, image enhancement and SAR/IR image coding. The DCT has 
played an important role in commercial applications involving DSP, most notably it 
has been adopted by MPEG ("Motion Picture Experts Group") for use in MPEG 2 and 
MPEG 4 video compression algorithms. 
10 A computation that is common in digital filters such as finite impulse response 

("FIR") filters or linear transformations such as the DFT and DCT may be expressed 
mathematically by the following dot-product equation: 

d=^a(i)*b(i) 

MO 

15 where a(i) are the input data, b(i) are the filter coefficients (taps) and d is the output. 
Typically a multiply-accumulator ("MAC") is employed in traditional DSP design in 
order to accelerate this type of computation. A MAC kernel can be described by the 
following equation: 

d mi = d m + fl(i) * &w with initial value d [0] = 0. 

20 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. la is a block diagram of a video encoding system. 

FIG. lb is a block diagram of a video decoding system. 

FIG. 2 is a block diagram of a datapath for computing a 2-D IDCT. 
25 FIG. 3 is a block diagram illustrating the operation of a MAC kernel. 

FIG. 4 is a block diagram illustrating the operation of a MAAC kernel 
according to one embodiment of the present invention. 

FIG. 5 illustrates a paradigm for improving computational processes utilizing 
a MAAC kernel according to one embodiment of the present invention. 
30 FIG. 6 is a block diagram of a hardware architecture for computing an eight 

point IDCT utilizing a MAAC kernel according to one embodiment of the present 
invention. 
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FIG. 7 is a block diagram of a datapath for computation of an 8-point IDCT 
utilizing the method of the present invention and a number of MAAC kernel 
components according to one embodiment of the present invention. 

FIG. 8 is a block diagram illustrating the operation of an AMAAC kernel 
5 according to one embodiment of the present invention. 

DETAILED DESCRIPTION 

FIG. la is a block diagram of a video encoding system. Video encoding 
system 123 includes DCT block 111, quantization block 113, inverse quantization 

10 block 115, IDCT block 140, motion compensation block 150, frame memory block 
160, motion estimation block and VLC ("Variable Length Coder") block 131. Input 
video is received in digitized form. Together with one or more reference video data 
from frame memory, input video is provided to motion estimation block 121 , where a 
motion estimation process is performed. The output of motion estimation block 121 

15 containing motion information such as motion vectors is transferred to motion 

compensation block 150 and VLC block 131. Using motion vectors and one or many 
reference video data, motion compensation block 150 performs motion compensation 
process to generate motion prediction results. Input video is subtracted at adder 170a 
by the motion prediction results from motion compensation block 150. 

20 The output of adder is provided to DCT block 111 where a DCT computed. 

The output of the DCT is provided to quantization block 113, where the frequency 
coefficients are quantized and then transmitted to VLC ("Variable Length Coder") 
131, where a variable length coding process (e.g., Huffman coding) is performed. 
Motion information from motion estimation block 121 and quantized indices of DCT 

25 coefficients from Q block 1 1 3 are provided to VLC block 131. The output of VLC 
block 131 is the compressed video data output from video encoder 123The output of 
quantities block 1 13 is also transmitted to inverse quantization block 115, where an 
inverse quantization process is performed. 

The output of inverse quantization block is provided to IDCT block 140, 

30 where IDCT is performed. The output of IDCT block is summered at adder 107(b) 
with motion prediction results from motion compensation. The output of adder 170b 
is reconstructed video data and is stored in the frame memory block 160 to serve as 
reference data for the encoding of future video data. 
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FIG. lb is a block diagram of a video decoding system. Video decoding 
system 125 includes variable length decoder ("VLD") block 110, inverse scan ("IS") 
block 120, inverse quantization block ("IQ") 130, IDCT block 140, frame memory 
block 160, motion compensation block 150 and adder 170. A compressed video 
5 bitstream is received by VLD block and decoded. The decoded symbols are converted 
into quantized indices of DCT coefficients and their associated sequential locations in 
a particular scanning order. The sequential locations are then converted into 
frequency-domain locations by the IS block 120. The quantized indices of DCT 
coefficients are converted to DCT coefficients by the IQ block 130. The DCT 

10 coefficients are received by IDCT block 140 and transformed. The output from the 
IDCT is then combined with the output of motion compensation block 150 by the 
adder 170. The motion compensation block 150 may reconstruct individual pictures 
based upon the changes from one picture to its reference picture(s). Data from the 
reference picture(s), a previous one or a future one or both, may be stored in a 

1 5 temporary frame memory block 1 60 such as a frame buffer and may be used as the 
references. The motion compensation block 150 uses the motion vectors decoded 
from the VLD 1 10 to determine how the current picture in the sequence changes from 
the reference picture(s). The output of the motion compensation block 150 is the 
motion prediction data. The motion prediction data is added to the output of the IDCT 

20 140 by the adder 170. The output from the adder 170 is then clipped (not shown) to 
become the reconstructed video data. 

FIG. 2 is a block diagram of a datapath for computing a 2-dimensional (2D) 
IDCT according to one embodiment of the present invention. It includes a data 
multiplexer 205, a ID IDCT block 210, a data demultiplexer 207 and a transport 

25 storage unit 220. Incoming data from IQ is processed in two passes through the IDCT. 
In the first pass, the IDCT block is configured to perform a ID IDCT transform along 
vertical direction. In this pass, data from IQ is selected by the multiplexer 210, 
processed by the ID IDCT block 210. The output from IDCT block 210 is an 
intermediate results that are selected by the demultiplexer to be stored in the transport 

30 storage unit 220. In the second pass, IDCT block 210 is configured to perform ID 

IDCT along horizontal direction. As such, the intermediate data stored in the transport 
storage unit 220 is selected by multiplexer 205, and processed by the ID IDCT block 
210. Demultiplexer 207 outputs results from the ID IDCT block as the final result of 
the 2D IDCT. 
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Many computational processes such as the transforms described above (i.e., 
DCT, IDCT, DFT, etc) and filtering operations rely upon a multiply and accumulate 
kernel. That is, the algorithms are effectively performed utilizing one or more 
multiply and accumulate components typically implemented as specialized hardware 
on a DSP or other computer chip. The commonality of the MAC nature of these 
processes has resulted in the development of particular digital logic and circuit 
structures to carry out multiply and accumulate processes. In particular, a 
fundamental component of any DSP chip today is the MAC unit. 

FIG. 3 is a block diagram illustrating the operation of a MAC kernel. 
Multiplier 310 performs multiplication of input datum a(i) and filter coefficient b(i), 
the result of which is passed to adder 320. Adder 320 adds the result of multiplier 310 
to accumulated output d [l] which was previously stored in register 330. The output of 
adder 320 ( d [i+1] ) is then stored in register 330. Typically a MAC output is generated 
on each clock cycle. 

The present invention provides a method and system for optimized numerical 
computations. The present invention is particularly suitable for multiply accumulate 
processes such as transforms, linear filtering operations, etc. One embodiment 
described herein relates to the application of the invention to a more efficient IDCT 
computation. The present invention may also be applied to the DFT, DCT, FFT 
and/or other multiply and accumulate processes such as those typically utilized in 
performing transforms. An embodiment described herein relates to the application of 
the MAAC architecture to calculation of the IDCT. However, the present invention 
may be applied in any multiply accumulate process such as a DCT, DFT, digital filter, 
etc. and the embodiments described herein are not intended to limit the scope of the 
claims appended hereto. 

In particular, the present invention provides for efficient computation of a 

N-l 

class of expressions of the form d = ^a(/)*Z>(/) • 1x1 order to im P rove ^ efficiency 

1=0 

of this class of computation, the invention utilizes a new computational architecture 
herein referred to as the MAAC architecture, and an AMAAC architecture, which 
provides for more efficient execution of this class of computational processes. 

According to one embodiment, the present invention provides a method and 
system for efficient and optimized DCT/IDCT computations by capitalizing upon the 
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novel algorithm realized through two new architectural component kernels 
specifically adopted for performing DSP operations such as the DCT and IDCT, In 
particular, the present invention provides a MAAC ("Multiply- Add- Accumulator") 
and AMAAC ("Add-Multiply- Add- Accumulator") kernel, which specifically 
5 capitalize upon the new algorithm described above . 

The present invention provides a new accumulator architecture, herein referred 
to as the MAAC kernel. The MAAC kernel can be described by the following 
recursive equation: 

= d [n + a(i)*6(0 + c(0 with initial value d [0] = 0. 

10 FIG. 4 is a block diagram illustrating the operation of a MAAC kernel 

according to one embodiment of the present invention. MAAC kernel 405 includes 
multiplier 310, adder 320 and register 330. Multiplier 310 performs multiplication of 
input datum a(i) and filter coefficient b(i), the result of which is passed to adder 320. 
Adder 320 adds the result of multiplier 310 to a second input term c(i) along with 

15 accumulated output which was previously stored in register 330. The output of 
adder 320 (d [M] ) is then stored in register 330. 

As an additional addition (c(i)) is performed each cycle, the MAAC kernel 
will have higher performance throughput for some class of computations. For 
example, the throughput of a digital filter with some filter coefficients equal to one 

20 can be improved utilizing the MAAC architecture depicted in FIG. 4. 

FIG. 5 illustrates a paradigm for improving computational processes utilizing 
a MAAC kernel according to one embodiment of the present invention. In 510, an 
expression for a particular computation is determined. Typically, the computation is 
expressed as a linear combination of input elements a(i) scaled by a respective 

25 coefficient b(i). That is, the present invention provides for improved efficiency of 
performance for computational problems that may be expressed in the general form: 

where a(i) are the input data, b(i) are coefficients and d is the output. As noted above, 
utilizing a traditional MAC architecture, output d may be computed utilizing a kernel 
30 of the form: 

d u+1] = d [i] +a(i)*b(i) with initial value d E ° 3 = 0. 
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This type of computation occurs very frequently in many applications 
including digital signal processing, digital filtering etc. 

In 520, a common factor *c* is factored out of the expression obtaining the 
following expression: 

5 d = c^aiiYVii) where b(i)=cb'(i). 

If as a result of factoring the common factor c, some of the coefficients b'(i) 
are unity, then the following result is obtained. 

f M~l N-l \ 

d = c\ ]Tfl(0*6'(0 + £a(0 where {b\i)=l-M<i<N-l} 

\ i=0 i=M ) 

* This may be effected, for example, by factoring a matrix expression such that 
1 0 certain matrix entries are ' 1 \ The above expression lends itself to use of the MAAC 
kernel described above by the recursive equation: 

d W = d m + a fy * b ty + c (/) with in i tia i value d [oi = 0 

In this form the computation utilizes at least one addition per cycle due to the unity 
coefficients. 

15 In step 530, based upon the re-expression of the computational process 

accomplished in step 510, one or more MAAC kernels are arranged in a configuration 
to carry out the computational process as represented in its re-expressed form of step 
520. 

The paradigm depicted in FIG. 5 is particularly useful for multiply and 
20 accumulate computational processes. According to one embodiment, described 
herein, the method of the present invention is applied to provide a more efficient 
IDCT computation, which is a multiply and accumulate process typically carried out 
using a plurality of MAC kernels. However, the present invention may be applied to 
any type of computational process, not only MAC processes. 

25 According to one embodiment, the present invention is applied to the IDCT in 

order to reduce computational complexity and improve efficiency. According to the 
present invention, the number of clock cycles required in a particular hardware 
implementation to carry out the IDCT is reduced significantly by application of the 
present invention. 

30 The 2-D DCT may be expressed as follows: 
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[2~ fT f(2/ + l)*^ r (27 + l)/?r 

y« = J «(&)J «(0> > ^,7 COS COS — — - — 



where a(k)= 



1 



ifk=0 



1 otherwise 
The 2-D IDCT may be expressed as follows: 



x« = y y jw — «(/)cos — ' — cos — — 



where a(k)= < 



4-, if k=0 
V2 



1 otherwisej 

The 2-D DCT and IDCT are separable and maybe factored as follows: 

M-l 

x B =XV/,«W fori=0,l,. . .,M-1 and j=0,l,. . .,N-1 

where the temporal 1-D IDCT data are: 

z v =lly*i e jA l ) fork=0,l,. . -,M-1 andj=0,l,. . .,N-1 

/=0 



and the DCT basis vectors c { (m) are: 



(2i + \)ln 



fori,k=0, 1,...,M-1 



\M * ' ^ 2M 

A fast algorithm for calculating the IDCT (Chen) capitalizes of the cyclic property of 
the transform basis function (the cosine function). For example, for an eight point 
IDCT, the basis function only assumes 8 different positive and negative values as 
1 0 shown in the following table: 



j/l 


0 


1 


2 


3 


4 


5 


6 


7 


0 


c(0) 


c(l) 


c(2) 


c(3) 


c(4) 


c(5) 


c(6) 


c(7) 


1 


c(0) 


c(3) 


c(6) 


-c(7) 


-c(4) 


-c(l) 


-c(2) 


-c(5) 


2 


c(0) 


c(5) 


-c(6) 


-c(l) 


-c(4) 


c(7) 


c(2) 


c(3) 


3 


c(0) 


c(7) 


-c(2) 


-c(5) 


c(4) 


c(3) 


-c(6) 


-c(l) 


4 


c(0) 


-c(7) 


-c(2) 


c(5) 


c(4) 


-c(3) 


-c(6) 


c(l) 


5 


c(0) 


-c(5) 


-c(6) 


c(l) 


-c(4) 


-c(7) 


c(2) 


-c(3) 


6 


c(0) 


-c(3) 


c(6) 


c(7) 


-c(4) 


c(l) 


-c(2) 


c(5) 


7 


c(0) 


-c(l) 


c(2) 


-c(3) 


c(4) 


-c(5) 


c(6) 


-c(7) 
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Where c(m) is the index of the following basis terms. 



elm) = a(m) cos\ 



= 1 cos — ,cos — 

1 UJ UeJ 



) 

,cos 



,cos 



= < COS 



16 j 



-COS — .COS — 

UJ U6 



,cos 



f 3x} f In 
cos 

8 " 



16 



(n\ (tz\ (tz\ (3n\ (n\ . ( 3n\ . ( x\ . ( n\ 
— ,cos — ,cos — ,cos — ,cos — ,sin — ,sin — ,sm — 

UJ U6j UJ uej UJ u°J UJ UeJ 



The cyclical nature of the IDCT shown in the above table provides the following 
relationship between output terms of the 1-D IDCT : 

^±^=L = e,(0)y o +ei {2)y 2 +e,(4)y 4 + e,(<>)y 6 



■ = eXl)y, + e t (3)y 3 + e,(S)y 5 + e,(T)y 7 



where the basis terms e f (k) have sign and value mapped to the DCT basis terms c(m) 
according to the relationship: 

e i (k) = ±±c(m(i,k)) 

10 For a 4-point IDCT, the basis terms also have the symmetrical property 

illustrated in the above table as follows: 



j/l 


0 


1 


2 


3 


0 


C(0) 


C(2) 


C(4) 


C(6) 


1 


C(0) 


C(6) 


-C(4) 


-C(2) 


2 


C(0) 


-C(6) 


-C(4) 


C(2) 


3 


C(0) 


-C(2) 


C(4) 


-C(6) 



The corresponding equations are: 
?d^L = e i (0)y 0+ e i (4)y 2 

?Ll^L = ei( 2)y l+ e i (6)y 3 

Based upon the above derivation, a ID 8-point IDCT can be represented by 
the following matrix vector equation: 
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*0 




y 0 








x 7 




Jo 






X, 


-U 


y 4 


+ Ib 


y$ 






-Ia 




-Ib 




x 2 


2 


y 2 


2 


y* 




x 5 


2 




2 


y% 


X, 




y 6 




yi 




X 4 








y 7 



where: 



A = 



c(0) c(4) c(2) c(6)" 

c(0) -c(4) c(6) -c(2) 

c(O) -c(4) -c(6) c(2) 

c(0) c(4) -c(2) -c(6) 



B= 



c(5) c(3) c(7)' 

c(3) -c(l) -c(7) -c(5) 

c(5) c(7) -c(l) c(3) 

c(7) c(3) -c(5) -c(l) 



7C 



and c(0) = cos — | and c(n)=cos 



UK 



— |(n=l,2, 3, 4, 5,67) 
V 16 



Note that A" 1 = i A r and B* 1 =-B r 
2 2 

Using the paradigm depicted in FIG. 5, a common factor may be factored from 
the matrix equation above such that certain coefficients are unity. The unity 
coefficients then allow for the introduction of a number of MAAC kernels in a 
computational architecture, thereby reducing the number of clock cycles required to 

carry out the EDCT. In particular, by factoring c(0)=c(4)= -^=r out from the matrix 
vector equation above, the following equation is obtained. 



x 0 




yo 




yi 




X 7 




y 0 




y\ 




-Ia- 


y< 


+ Ib- 


y s 




x 6 


-Ia- 


y 4 


-Ib- 


y$ 


x 2 


2 


y 2 


2 


y* 




x s 


2 


y 2 


2 




*3 




y 6 




yi 




x 4 




y 6 




yi 



where: 



A* = 



1 1 c'(2) c'(6)" 

1 -1 c'(6) -c'(2) 

1 -1 -c'(6) c'(2) 

1 1 -c'(2) -c'(6) 

1 



B' 



c'(l) c'(5) c'(3) c'(7) 

c'(3) -c'(l) -c'(7) -c'(5) 

c'(5) c'(7) -c*(l) c'(3) 

c'(7) c'(3) -c'(5) -c'(l) 



Because the factor —j= is factored out of the matrix vector equation, the results after 
V2 



two-dimensional operations would carry a scale factor of two. Dividing the final 
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result by 2 after the two-dimensional computation would result in the correct 
transform. 

Note that the expression for the IDCT derived above incorporates multiple 

N-l 

instances of the generalized expression d = ^ a(i) * b(i) re-expressed as 



f M-l N~\ \ 



5 rf = c 



2] a(i) * b \i) + ]T c(0 where {b '(/) = 1 : M < z < TV - 1} to which the present 

V i=0 i=M J 

invention is addressed. This is a consequence of the nature of matrix multiplication 
and maybe seen as follows (unpacking the matrix multiplication): 



* = Jo -y* + c'(6)* J 2 -c'(2)*y 6 + c\3)* yi -c'(l)*y 5 -c'0)*y 3 -c'(5)*y 7 
* 2 = Jo ~ J 4 -c\6)*y 2 + c'(2)*y 6 + c'(5)* ^ -c'(7)*>> 5 -c'O)*^ + c'(3)*>> 7 
x 3 - % + J 4 -c\2)*y 2 -c\6)*y 6 +^(7)*^ + cP)*^ 5 -c'(5)* j; 3 -c'(l)*>> 7 
x 7 =>>o+J4+^2)*y 2 +cX6)*^^^ 

= Jo - J 4 +cX6)*y 2 ~cX2)*y 6 -c\3)* yi + c'(l)*y 5 +c'(T>* J 3 + c'(5)*y 7 
* 5 = Jo " J 4 J> 2 + c'(2)* J 6 -c'(5)* J, +cX7)*y 5 + c\\)*y 3 -c'(3)*y 7 

* 4 ^o + J4-cX2)*^ 2 -cX6)^ 

10 Note that the above expressions do not incorporate scale factors which can be 
computed at the end of the calculation simply as a right bit-shift, 

FIG. 6 is a block diagram of a hardware architecture for computing an eight 
point IDCT utilizing a MAAC kernel according to one embodiment of the present 
invention. The hardware architecture of FIG. 6 maybe incorporated into a larger 

15 datapath for computation of an IDCT. As shown in FIG. 6, data loader 505 is coupled 
to four dual MAAC kernels 405(l)-405(4) ? each dual MAAC kernel including two 
MAAC kernels sharing a common multiplier. Note that the architecture depicted in 
FIG. 6 is merely illustrative and is not intended to limit the scope of the claims 
appended hereto. The operation of the hardware architecture depicted in FIG. 5 for 

20 computing the IDCT will become evident with respect to the following discussion. 

Utilizing the architecture depicted in FIG. 6, a 1-D 8-point IDCT can be 
computed in 5 clock cycles as follows: 
1 st clock: 
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multl = c'Q)*y l 
mult2 = c\3)*y l 
mult3 = c'(5)*y l 
mult4 = c'{7)*y l 

x 0 {clkl) = y 0 +multl + 0 
x 7 {clkl) = y 0 - multl + 0 
Xl (clk\) = y 0 +mult2 + 0 
x 6 (clk\) = y 0 -mult2 + 0 
x 2 {clkl) = y 0 +mult3 + 0 
x 5 {clkl) = y 0 -mult3 + 0 
x 3 {clkl) = y 0 + mult 4 + 0 
x 4 {clkl) = y 0 -mult4+0 

2nd Clock 
mult\ = c\5)*y i 
mult2 = -c'{l)*y 5 
mult3 = -c\l)*y 5 
mult4 = c\3)*y 5 

x 0 {clkl) = y 4 + multl + x 0 {clkl) 
x 7 {clkl) = y 4 - multl + x 7 {clkl) 
x x {clkl) = -y 4 + multl + x, {clkl) 
x 6 {clkl) = -y A - mult2 + x 6 {clkl) 
x 2 {clkl) = -y 4 + mult3 + x 2 {clkl) 
x 5 {clkl) = -y 4 - mult3 + x 5 {clkl) 
x 2 {clkl) = y 4 + mult 4 + x 3 {clkl) 
x 4 {clk2) = y 4 - mult 4 + x 4 {clkl) 



3rd Clock 
multl = c'(3)*y 3 
mult2 = -c'(7)*y 3 
mult3 = -c\l)*y 3 
multA = -c'(5)*y 3 

x 0 (clk3) = 0 + multl + x 0 (clk2) 
x 7 (clk3) = 0 - multl + x 7 (eft 2) 
x, (elk?,) = 0 + mult 2 + x, (clk2) 
x 6 (clk3) = 0 - multl + x 6 (clk2) 
x 2 (clk3) = 0 + mult3 + x 2 (elk?) 
x 5 (clk3) = 0 - mult3 + x 6 (clk2) 
x 3 (clk3) = 0 + imifc4 + x 3 (clk2) 
x 4 (clk3) = 0 - mult A + x 4 (clk2) 



4th Clock 
multl = c\l)*y 7 
mult2 = -c'(5)*y 7 
mult3 = c'(3)*y 7 
multA = -c'(l)*y 7 

jc 0 (elk A) = 0 + mwM + x 0 (clk3) 
x 7 (elk A) = 0 - multl + x 7 (clk3) 
x l (elk A) = 0 + mwft2 + x l (clk3) 
x 6 (clkA) = 0-mult2 + x 6 (clk3) 
x 2 (elk A) = 0 + mult3 + x 2 (clk3) 
x 5 (elk A) = 0 - mult3 + x s (clk3) 
x 3 (elk A) = 0 + mult A + x 3 (clk3) 
x 4 (clkA) = 0- mult A + x 4 (clk3) 



i 



5th Clock 
multl = c\2)*y 2 
multl z =c\6)*y 6 
mult3 = c'(6)* y 2 
mult4 = -c\2)*y 6 



x 0 (clkS) = multl + multl + x 0 (clk4) 
x 7 (clkS) = multl + multl + x 7 (c/£4) 
(c/£5) = mult3 + mult A + x 2 (c/&4) 
x 6 (clkS) = w2w/£3 + mw/?4 + x 6 (c/£4) 
x, (c/£5) = -mult3 - mult 4 + x> (clk4) 
x 5 (clkS) = -multl> - jwwZf4 + x 5 (c/£4) 
x 3 (c/&5) = -multi — multl + x 3 (c/A;4) 
x 4 (clkS) = —multl - multl + x 4 (c/£4) 



FIG. 7 is a block diagram of a datapath for computation of an 8-point IDCT 
utilizing the method of the present invention and a number of MAAC kernel 
components according to one embodiment of the present invention. Note that the 
5 datapath shown in FIG. 7 includes four dual MAAC kernels 405(l)-405(4). 

According to an alternative embodiment, the MAAC kernel is modified to 
include two additional additions, to produce a structure herein referred to as the 
AMAAC kernel. The AMAAC kernel can be described by the following recursive 
equation: 

10 d [M} = d [i] + [a(0 + e(03*&(0 + c(0 with initial value ^ = °- 

FIG. 8 is a block diagram illustrating the operation of an AMAAC kernel 
according to one embodiment of the present invention. AMAAC kernel 805 includes 
multiplier 310, first adder 320a, second adder 320b and register 330. First adder 320a 
adds a(i) and e(i) Multiplier 310 performs multiplication of input datum [a(i)+e(i)] 

1 5 and filter coefficient b(i), the result of which is passed to adder 320b. Adder 320b 
adds the result of multiplier 310 to a second input term c(i) along with accumulated 
output d PI , which was previously stored in register 330. The output of adder 320 
( d [l+l] ) is then stored in register 330. 

As two more additions are performed during the same AMAAC cycle, the 

20 AMAAC kernel has a higher performance throughput for some class of computations. 
For example, a digital filter with some filter coefficients with equal value can take 
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advantage (speed up) of the AMAAC kernel. Specifically, a(i), c(i), and e(i) can be 
considered as input data and b(i) as filter coefficients. With inputs a(i) and e(i) having 
the same filter coefficients b(i) and inputs c(i) with unity coefficients, all three groups 
of inputs can be processed in parallel. 
5 According to one embodiment, MAAC and AMAAC computational kernels 

may be combined to generate a reconfigurable computation engine (for example, to 
compute the IDCT). By allowing this reconfiguration, hardware logic gates can be 
shared to improve performance with incurring additional cost. The AMMAC kernel 
provides a structure for achieving more efficient downsampling computations. 
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